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Abstract — Today's datacenters face important challenges for 
providing low-latency high-quality interactive services to meet 
user's expectation. For improving the application throughput, 
recent research works have embedded application deadline infor- 
mation into design of network flow schedule to meet the latency 
requirement. Here, arises a critical question: does application- 
level throughput mean providing better quality service? We note 
that there are usually a set of semantic related responses (or flows) 
for answering a query; and, some responses are highly correlative 
with the query while others do not. Thus, this observation 
motivates us to associate the importance of the contents with the 
application flows (or responses) in order to enhance the service 
quality. 

We first model the application importance maximization prob- 
lem in a generic network and in a server-centric network. Since 
both of them are too complicated to be deployed in the real 
world, we propose the importance-aware delivery protocol, which 
is a distributed event-driven rate-based delivery control protocol, 
for server-centric datacenter networks. The proposed protocol is 
able to make use of the multiple disjoin paths of server-centric 
network, and jointly consider flow importance, flow size, and 
deadline to maximize the goodput of most-related semantic data 
of a query. Through real-data-based or synthetic simulations, the 
results show that our proposed protocol significantly outperforms 
and MPTCP in terms of the precision at K and the sum of 
application-level importance. 

I. Introduction 

Over the past few years, deadline-aware services 
such as web search, online social network, advertis- 
ing/recommendation system, data warehouse, and online 
retail systems have been rapidly emerging. Datacenters, as 
critical computing platforms for ever-growing, high-revenue, 
must fulfill the requirements of deadline-aware services. Those 
requirements include: (1) most of deadline-aware services 
require responsiveness in the sub-second time scale at high 
request rates, because the shorter and in-time application 
latency is the key for satisfying user requirement. This is 
why soft real-time constraints are used for deadline-aware 
services. (2) Many of those services employ the partition- 
aggregate operations to parallelly process the massive data 
sets, which are distributed on thousands of servers, in 
a divide-and-conquer mannerfTTl. Those operations might 
generate all-to-all and all-to-one communication patterns (e.g., 
MapReducel 18|) at the same time. Hence, the datacenter 
network must be robust enough to cope with those rapid 
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traffic patterns. 

Recent works on datacenter networks llgUTTl [T9ll has shown 
that the key issues of current datacenters are the tree-based 
infrastructure and the traditional TCP-like transport protocol. 
The first issue is that the tree-based infrastructure cannot meet 
the requirements of high-performance large-scale datacenters, 
which may consist of thousands to millions of servers flOl; 
and these requirements include: (1) high interconnection band- 
width, i.e., provide multiple disjoin path between servers, (2) 
low-cost interconnection structure, i.e., only use commodity 
switches without changes, and (3) low cabling complexity. 
We note, recently, some server-centric topology designs, e.g., 
BCube[91, and MDCube llTOl . have been proposed to meet 
those requirements. 

As for the second issue, the TCP shares the network 
resources in a laissez-faire manner, which leads to many 
problems such as TCP Incast[16|, low bandwidth utilization 
when multiple paths exist ifTTI . and missing deadline|7| prob- 
lems. Many literatures have been studied for coping with the 
problems of traditional TCP. For example, DCTCP|19| and 
ICTCP|20l are proposed to mitigate TCP Incast problems. 
MPTCP ifTTl is proposed to well utilize multipath infrastruc- 
ture. HULL ll2n trades a little bandwidth to maintain low 
network latency for soft real-time services. D'^||71 directly 
take flow deadline into account. D"^ proactively suspend some 
flows to guarantee the remaining flows can meet deadline so 
as to provide better goodput, i.e., application-level through- 
put where the flows successfully meet their deadline. With 
those newly proposed infrastructures and protocol designs, the 
aforementioned two requirements of deadline-aware services 
are investigated and fulfilled the robustness of the data center 
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network and the soft real-time constraints. 

However, we notice that the robustness of the network in- 
frastructure and the soft real-time constraints of responses can- 
not totally fulfill the requirements of deadline-aware services. 
According to recently studies|l -4|, the users most concern 
about is the first few results of deadline-aware services. That 
is, for the deadline-aware services, the quality of responses, 
i.e. the precision of top-K results, may significantly affect the 
degree of user satisfaction on usage experience, and hence this 
might have great impact on the revenue of the business. This 
also means that the high goodput, which can be provided by 
existing deadline-aware protocols [7J, is not able to guarantee 
to get high-quality results for users. For example, in figure 
[T] four flows are traveling through a bottleneck switch at the 
same time, and the number in the flows indicates the rank 
of response units within the flow. Figure [TJa) shows that 
the bottleneck bandwidth cannot let all flows pass through 
and meet their deadlines. Figure [TJb) illustrates the behavior 
of D'', which allows requests to be passed through the data 
center in a first-come first-serve (FCFS) manner Since the 
bottleneck bandwidth could support only three flows, one 
of the flows will be dropped to allow other flows to meet 
deadline. If the flow with responses of rank 1, 3, and 4 is 
dropped, the results will has poor quality for a user. Figure [TJc) 
demonstrates an example of our heuristic protocol, only several 
low-rank response units, i.e., rank 14, 15, and 16, are dropped 
and all high-rank response units are delivered before the 
deadline. That is, although the network bottleneck exists, the 
deadline-aware services can still provide high quality results 
to users. Hence, we believe how to provide high-rank results 
for users is an important issue toward high-quality deadline- 
aware services; and, our ideas for design of an importance- 
aware datacenter network are as follows: 

1) Application-level importance is associated with each 
response unit, not flows, while deadline is associated 
with flows. The flows (and the response units within it) 
are valid only when they arrive before the deadline. 

2) The flow size may vary from less than 10KB to more 
than 500KB because of the different applications. That 
is, the protocol must be aware of the size of the flows 
as well. 

3) Some of the flows are short and so is their deadline. 
In addition, the RTT is small in a datacenter So, the 
reaction time of the delivery control protocol must be 
short enough to cope with those short and near-deadline 
flows. That is, centralized or heavy-weight mechanisms 
are impractical. 

4) The application-level importance of all applications in a 
datacenter must be normalized to provide a fair-sharing 
network. However, this can be measured and normalized 
manually or automatically by datacenter administrators. 

In this work, we exploit the application-level content in- 
formation, i.e., data importance, to meet the most important 
requirement, i.e., the precision of top-K results, which affects 
user experience on deadline-aware services. To figure out 



the optimal solution and performance upper bound, we first 
model the datacenter network as a flow network for analyzing 
the application importance maximization problem with the 
constraints of network link capacity, flow conservation, and 
flow deadline. However, due to the complexity of the optimal 
solution, we do not consider it can be put into practice. 

Therefore, we further propose a novel distributed rate-based 
importance-aware delivery control protocol. The basic idea 
of our protocol is to greedily choose a set of the flows in 
the networks that can contribute the most importance before 
deadline under the constraint of the network capacity. The 
information of flows travelling on a switch, i.e., the deadline, 
remain size, averaged data importance in the flow, sending 
rate of each flow, is known by all servers connected to the 
switch at the initiation stage of the flows. Hence, each server 
can estimate if its upcoming flow will contribute more data 
importance than existing flows by a new proposed metric 
jointly considering remain size, remain transmission time, and 
averaged data importance of a flow, such that we can suspend 
less-important flows and replace them with new upcoming 
flow. To utilize the multiple disjoin paths while maximize 
application data importance, we choose the combination of 
paths and rates that affects less data importance in all available 
disjoin paths to the destination. Notice that the deadline of 
flows and traffic amount in a datacenter vary over time, and 
this means some of the suspended flows may be recovery 
later for them, which can still meet their deadlines. Hence, 
once a flow has completed, the source of the flow will inform 
the owners of the suspended flows to see if any of them 
can recover flows before their deadline. To further improve 
the delivery ratio of highly important response units, we 
use statistical clustering algorithm, i.e., k-mean algorithm, to 
cluster responses of a flow into two (or more) small flows 
based on the importance of responses. That is, most of highly 
important response units of any requests will be delivery in 
deadline while less-important responses will be ignored for 
improving user perceived service quality. In summary, the 
advantages of the proposed protocol include: 

1) We design the proposed protocol in a fully distributed 
manner, i.e., no centralized controller is required. 

2) The proposed protocol adaptively adopts the most bene- 
ficial flows on the fly and recovers less-important flows 
when network resources are available. 

3) We split the original flow into two (or more) small 
flows by statistical clustering algorithm, i.e. the k-means 
algorithm, for further improving the successful delivery 
ratio of important response units. 

4) The proposed protocol is able to make use of multiple 
disjoin links and balances the load between links for 
enhancing the throughput. 

5) No expensive or customized hardware is required, i.e., 
only commodity switches without changes are used. 

The results based on real data show that the proposed 
algorithm can provide better application-level user perceived 
performance than MPTCP|11| and D'^|7|, i.e. the precision 
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at K. That is, we believe our protocol can better fulfill the 
requirements of deadline-aware services for providing better 
user experience, and hence producing more revenue for the 
providers. 

II. Problem Formulation 

In this section, we formulate the application importance 
maximization problem for the optimal solution, which also 
means to find the maximal sum of the importance of flows 
in the network which meet deadline. We first model the 
importance-aware datacenter networks and formulate the prob- 
lem. The complexity of the optimal solution in server-centric 
networks is then addressed. In summary, we found that the 
problem cannot be solved in polynomial time. Therefore, we 
present a distributed heuristic protocol based on similar ideas 
of the optimal solution. 

A. Model of Importance-aware Data Center Networks 

We consider a data center network is represented by 
G=(V,E), where V represents a set of nodes, i.e., servers 
or switches, and E represents the links between nodes. The 
bandwidth capacity of each link (u, u) € i? is C{u,v). There 
are flows, i.e., Fi,F2, ...,Fn, in the data center network, 
and each flow has its source Si, destination ti, begin time bi, 
deadline di, and a set of response units Ri. For each response 
r in Ri of flow Fi, the importance of r, i.e., the similarity 
to the query of Ri, is denoted as rrii, and the size of each 
response r is denoted as rsr- Therefore, flow size Fsi is equal 
to the sum of the response unit size of R^. We then denote 
the averaged importance of Fi as Ii= — jjfj^ — ■ Consequently, 
we regard the flow in the importance-aware network as a five- 
tuple vector, i.e., Fi={si, ti, di, Fsi, li). The transmission rate 
of Fi on link (u, v) at time t is represented by ' {u, v). 

B. Global Optimization in Data Center Networks 

In a data center network, once a flow cannot meet its 
deadline, all response units of that flow are regarded as lost. 
This property makes how to maximize application importance 
becomes difficult. Hence, we split each flow into several tiny 
flows, and each tiny flow contains a single response unit. That 
is, the new flow set N' consists of X^ili 11^*11 flows. Since 
each new flow only contains one smallest unit that contribute 
importance, we do not allow them to be split anymore to 
guarantee they can be completely delivered. The new flows can 
also be defined in the form of five tuples as: Fi=(si, ti, di, Rsi, 
rrii). Assume the end time of the flow Fi is e^, the first begin 
time of the flows is Tp, the last end time of the flows is Tq, and 
the transmission and queueing delay of flow Fi is A^^'"^, we 
can define the application importance maximization problem 
as: 
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subject to: 

1) Link capacity constraint: for each link {u,v) at time t, 
the total flow rate on the link must be equal to or smaller 
than the link capacity. 



N' 



(3) 



2) Flow conservation constraint: the amount of in-flow and 
out-flow must be the same on all nodes unless the source 
or the destination of the flow is the node. 
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We notice that the application importance maximization 
problem can be reduced to an unsplittable flow problem. 
In the unsplittable flow problem, a 7i-vertex graph G{V,E) 
with edge capacity Ce and the set of k vertex pairs T = 
{si,ti),i = 1, k are given. Each pair (si,t,;) in T has a 
demand pi and a weight Wi. The goal attempts to find the 
subset of pairs from T with the maximum weight so that 
entire demand for each of such pair can be routed on its path 
under the link capacity constraint. Now we suppose all flows 
in the network has the same begin time and deadline, which 
represents a simplified case of the maximization problem. We 
can map the set of N' flows to the set of k vertex pairs, the 
minimal averaged transmission rate rf'^^^{si, ti) that meets its 
deadline to the demand pi, and the importance of the flow nii 
to the weight Wi, so as to convert this application importance 
maximization problem to the unsplittable flow problem. That 
is, the result set of the unsplittable flow problem denotes 
the flows that maximize the application importance. This also 
represents that for the flows whose demand cannot be satisfied 
in the unsplittable flow problem, they should be dropped in 
advance because they will degrade the sum of the demand, i.e., 
the sum of application importance. However, the unsplittable 
flow problem is regarded as NP-complete| 12. 13], and solving 
the unsplittable flow problem needs global information, i.e., a 
centralized system is required. Those limitations make this 
solution become impractical for datacenter networks. 

C. Local Optimization in Server-centric Datacenter Networks 

One of the critical problem in the unsplittable flow problem 
is to find the path for flow that meets its demand, i.e., 
the minimal averaged transmission rate. It is difficult to solve 
because in a datacenter network, many intersecting path may 
cross with each other on nodes in a network. Finding those 
link-joint path makes the allocation of capacity to each flow 
become complicated. We note that the server-centric network 
architectures not only have many advantages over traditional 
network architectures; and, one of their properties is to provide 
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fixed number of link-disjoin paths between servers. This prop- 
erty can help us on reducing the complexity of the application 
importance maximization problem, and we will explain the 
details as follows. 

Our basic idea is to allow each server to determine whether 
it can start a new flow by traversing through the disjoin paths 
to the destination. If (a) the smallest remain capacity of the 
paths is insufficient to meet deadline, and (b) the importance 
of the new flow is less than any other flows on the paths, 
the flow will be suspended till there are sufficient available 
capacity to meet its deadline. If the new flow can contribute 
more importance than existing flows, it will notify the less- 
important flows to be suspended and starts transmission. That 
is, each server can decides if a flow should be transmitted 
according to the partial information of the network. 

After the disjoin paths are given, we notice that the appli- 
cations importance maximization problem can be reduced to 
a multiple knapsack problem. We first map the available link 
capacity C{u, v) of link (u, v) to the capacity of knapsack 
W(u,v)- The flow Fi is mapped to the item Xi that are going 
to be place in the knapsack. The importance of the flow Fi, 
rrii, is mapped to the value of the item as i;,;. The minimal 
averaged transmission rate rf'^^^^{si, ti) that meets its deadline 
is mapped to the weight of the item Wi. We then define 
the set of disjoin paths between Sj and ti that flow Fi will 
pass through as Li. This set of links Li can be treated as a 
set of knapsack sets Ki of flow Fi. A knapsack set of Ki 
consists of knapsack k{si,m), ...,k{n,ti), which represents 
the available capacity of a disjoin hop-by-hop path from Si to 
ti and all edges on this path are denoted as Ke for latter use. 
We use fc,^(„j „) to denote this knapsack set and use smallest 
available capacity of the knapsack in the set as the capacity 
W^i,(m,n) — min W^(ti,D)7 {u, v) E Ke. Once a item Xi is placed 
in a knapsack set kij^^.n), we denote it as Xi /^^.n)^ and at the 
same time the available capacity of all knapsacks in the set 
will be reduced by Wi. Therefore, assume there are N flows 
sent from the source node, we can reduce the local application 
importance maximization problem as follows: 
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' 0, otherwise 
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The multiple knapsack problem, however, is known as a NP- 
complete problem lfT3l so it requires very long computational 
time while the flows are used to be short and massive. Even if 
the information of all upcoming flows are known, the problem 
can be solved in pseudo-polynomial time; but, this is still 
impractical in datacenter networks. 



III. Importance-aware Delivery Framework for 
Server-centric Networks 

In this section, we describe our heuristic importance-aware 
delivery protocol for server-centric datacenter networks. The 
design goals of our protocol include: 

1) Maximize application importance, i.e., providing high 
quality results for deadline-aware services by finishing 
most important flows before deadline. 

2) Light-weight and fully distributed, i.e., fast enough to 
cope with datacenter networks and no need of cus- 
tomized or expensive hardware. 

3) High network utilization, i.e., exploiting all available 
static multi-disjoin paths of server-centric networks. 

The key insight guiding our protocol design is: given (a) the 
topology information, (b) the size and importance of each 
response unit in a flow, as well as (c) the deadline of a flow, 
each server in the server-centric networks can determine (1) 
which part of the flow is important, (2) the rate to meet the 
deadline constraint. The servers, then, can determine if the rate 
should be allocated according to the importance of the flow 
and the network status of the paths to the destination. 

A. Flow Importance Contribution Metric 

Before introducing our protocol, we must address a funda- 
mental issue: How to determine whether a flow is important? 
If all flows begin at the same time and they have the same 
deadline and size, the answer is quite simple. We use li of 
flow Fito decide if it can contribute more importance than 
other flows. However, in the datacenter network, the flows 
come or leave all the time, and their deadline, size, and 
importance are different. Thus, we need a new metric to 
determine the importance of a flow. Our idea is to measure 
the carried importance per data unit per time unit. This is 
because after some transmission time, the remaining size of a 
flow will become small. However, only when the remaining 
part of the flow is fully delivered, the importance of the 
flow can be count in. That is, the metric must take the 
remaining size and remaining time of the flow into account in 
by considering temporal and spatial diversity together. Hence, 
we imposes the averaged flow importance li, remaining size 
RSi, and remaining transmission time i?T,; to compute the 
flow importance contribution (FIC) of flow Fi. The FIC of 
flow Fi is defined as: 



FIC, = 



RSi ■ RTi 



(9) 



B. Protocol Overview 



We first give a overview of the protocol. Our protocol is a 
distributed event-driven rate-based delivery control protocol, 
which means we assume the datacenter network is able to 
assign each flow by a fixed transmission rate on multiple paths 
till (1) a server tries to initial a new flow, (2) a flow ends, 
hence some capacity becomes available and each individual 
server will determine if: (a) Some of its suspended flows can 
be awaken, for they are more important and the available 
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Fig. 2. A simplified 2-level BCube topology 



capacity are sufficient to meet their deadline, and (b) some of 
its flows can increase their transmission rate to shorten their 
transmission time 

To improve the delivered application importance, we further 
split a flow into two (or more) small flows by the statistical 
clustering algorithm, e.g. the k-means algorithm, for improv- 
ing the successful delivery ratio of more important response 
units. The details of the protocol are in the following. 

1 ) Distributed Rate Control: In the server-centric datacen- 
ter networks, several servers are connected by a commodity 
network component, i.e., a switch, as a set of servers. Those 
servers may connected to other servers via another network 
component to form a larger network. Hence, we define the 
neighbors of a server as the other servers connecting with the 
server by the same switch. For example, figur^ shows a sim- 
ple BCube|9| network, in which server Sac, Sad, Sae, Saf 
are connected by the same switch iV^, and server Sac, Sbc 
are connected by Nq. Therefore, the neighbors of Sac include 
Sad, Sae, Saf, and Sbc- With our protocol, each server 
maintains the remaining link capacity of all links to its neigh- 
bors via connected network components. It also maintains the 
information of the flows that traverse through its neighbors or 
itself, which includes the deadline of flow, the transmission 
rate of flow, and the averaged importance of the flow. With 
those information, the servers connected to the same switch 
can carefully allocate the rates of the links of the switch, so as 
to prevents oversubscribing the capacity of each link, which 
could result in building up packet queue on the switch, packet 
loss, and making flows missing deadline. 

2) Flow Initialization: As the deadline-aware applications 
expose flow size, deadline, and importance information when 
initializing a flow Fi, the server can calculate the flow's 
importance contribution FICi and minimal transmission rate 
rf^~^'^{si,ti) ~ J^l'hi - Then, the server looks up the shortest 
equal-cost disjoin paths (and the second shortest paths if only 
one shortest path exists) to the destination for the next hops. 
The rate request is estimated at the server based on the flow 
information of its neighbors. If the network is under-loaded, 
the rates of existing flows with lower FIC than FICi will be 



allocated till the rates can satisfy the minimal transmission rate 
of the flow. If the network is over-loaded, the existing flows 
will be asked to release the extra rates they occupied other than 
their minimal rates. If the allocated rate is still insufficient to 
satisfy the flow, the network will be treated as over-loaded 
and remaining rate will be allocated from existing flows with 
lower FIC than FICi. However, sometimes the network is 
light-loaded. In this case, after all flow successfully allocates 
its minimal transmission rate, the remaining capacity will be 
assigned to each flow proportion to its FIC. 

Once the initial rate request is succussed at the server, the 
request will be updated and forwarded to the next hops, i.e., 
the relay nodes, to further allocate the rate from the next hop 
to the next two hops, and so on. The demanded rate and 
available FICi of a path, which is recorded in the rate request, 
is proportion to the estimated available capacity of the paths 
based on the source's local information to balance the load 
between paths. To maximize the application importance by 
trading less-important flows with most-important flows, we 
only allow the new flow that its FIC is less than the sum 
of FICs of the affected flows on each node to be initialized. 
That is, each server along the path will allocate the most 
rate given available FIC. The rate request will be processed 
and forwarded to the next hop till reaching the node on the 
same switch of the destination. Notice that on each node, 
the allocation should be done twice, i.e., to allocate the rate 
from the node to the network component, and the rate from 
the network component to the next hop. The detailed rate 
allocation algorithm can be found in Algorithm [T] 

After the server receives the responses of the rate request, 
it can determine the available capacity of a path for flow 
Fi by the smallest available rate reported by the nodes on 
the path. Therefore, if the sum of the smallest available 
rate of all paths is less than the minimal available rate, the 
flow should not be transmitted. Otherwise, the server will 
allocate the flow rate proportion to the available capacity of 
each path, and broadcast the rate allocation notification to 
all neighbors. The notification will be re-broadcasted by the 
relay nodes until reaching the destination. Besides, for some 
existing flows might be suspended, the on-path node on the 
same switch of the flows will take the responsibility to inform 
the sources of to-be-suspended flows to stop the transmission. 
The information of the suspended flows is maintained by this 
node for further flow recovery. Note that the RTTs in the 
datacenter network is very small (sa 300/is), thus the protocol 
should take very short time to allocate the rate and suspend 
existing flows. 

For example, we let the Sad as the flow source and Sbe 
as the flow destination. The minimal transmission rate of 
the new Fi is 120kbps. The shortest disjoin paths between 
Sad and Sbe are PI = {Sad,{Nd),Sbd,{Nb),Sbe}, 
P2 = {SAD,iNA),SAE,{NE),SBE}- Next, Sad wifl es- 
timate the available link capacity between {Sad, Sbd) — 
180kbps , and (Sad, Sae) = 220kbps. Since the sum of 
the available capacity is greater than the minimal transmission 
rate, the flow can be transmitted to the next hop. Sad then 
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Algorithm 1 Rate Allocation on Each Node 

Input: minimal transmission rate rdesired, the link set Ltr to 

the nexthop, available FICf of the flow of the path. 
Output: allocated rate for the flow Taiiocated- 

Recorded Info:remaining capacity of the links in Ltr, 

flow information in Ltr- 
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27: 
28: 



while Ltr 7^ do 

L^ur ^ GET-NEXT-LINK(Lt^) 
if remaining link capacity in L^ur > ^desired then 
if remaining link capacity - rdesired > then 
Tadd ^ assign spare capacity in proportion to 
FICs 
else 

radd ^ 
end if 

^^Lcur r desired ^^dd 

else 

F -It- existing flows traversing through Lcur 
R desired rate rdesired 
FIC ^ available FIC of the path FICf, 
while F ^ (f) Ao 

f, ^ EXTRACT-MIN-FIC(F) 

ri ^ transmission rate of the flow fi 

fici ^ FIC of the flow fi 

if FIC < fici then 

es'trea 

- R 

break; 
end if 

R i — R — Rf, 
FIC ^ FIC - fiCi 
end while 
end if 
end while 

rallocated ^ MIN(rt(*)) 

return raiiocated'. 



sends the rate requests to the relay nodes Sbd and Sae to 
54kbps on path PI and hence 66kbps 



allocate 1^0*220 
on path P2. Because nodes Sbd and Sae are on the same 
switches of the destination Sbe, the rate request no longer 
needs to be forwarded. Now suppose the smallest available 
rate of path PI is 60kbps and of the path P2 is 90kbps. Sad 
will allocate = 48kbps on path PI, and hence 72kbps 

on path P2. 

3) Flow Completion and Flow Recovery: When a flow 
finishes its transmission, a notification will be broadcasted and 
forwarded, which is similar to the final procedure of the flow 
initialization. The notification is to tell the nodes along the 
path and all of their neighbors that some network resources 
are released. Hence, those nodes can determine if (1) some 
suspended flows can be awaken, and (2) the transmission rate 
of existing flows can be increased. This also triggers the nodes 
along the path that suspended flows for the ending flow to 
inform the sources of the suspended flows for recovering. The 
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flow recovery and rate adjustment procedure is exactly the 
same as the aforementioned flow initialization procedure. 

The main reason we allow the recovery of suspended flows 
is, in the datacenter network, the deadline of flows is different. 
Hence, some suspended flows may be recovered because they 
have longer deadline. That is, flow recovery can further help 
to maximize application importance. 

4) Flow Splitting: In Section II, we split each flow into 
tiny flows that each of them contains only one single response 
unit for maximize the application importance. However, it is 
impractical to enforce the same approach in real networks 
because it might generate massive amount of flows. However, 
we notice that in the datacenter, the datasets of deadline-aware 
services are mostly uniform distributed over a lot of servers. 
Therefore, for a general query, the averaged importance of 
the response flows could be similar. This makes it hard to 
distinguish which flow is more important than others. We note 
that for a general query, a server will generate a few results 
with high similarity, and others with low similarity. Figur^ 
shows the importance distribution of the response units within 
the flows. The response unit with deep color represents high 
correlation between this response unit and the corresponding 
query. This result is generated based on the real-world dataset 
from Nil Test Collection for IR Systems (NTCIR) Project[3 
and the dataset are distributed to the hosts based on the data 
placement policy of GFS||6l. Therefore, we believe the phe- 
nomenon can also be found in many deadline-aware services. 
Based on this observation, we employ statistical clustering 
algorithm, i.e., k-mean algorithm, to divide the flow into 
several small flows. How many parts a flow should be split 
is ought to be determined by the application administrators 
and it is beyond the scope of this work. Here, we simply split 
a flow into two smaller flows, one of them consists of most 
important response units, and the other consists of remaining 
response units. Indeed, the most important flow has higher 
FIC than the other. Thus, those most important small flows 
will be transmitted prior to other flows. This further increases 
the successful delivery ratio of most important response units, 
and hence help to maximize application importance. 
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IV. Performance Evaluation 

We implement a three-level BCube network architectures 
with parameter fc = 5 as the server-centric datacenter infras- 
tructure on the event-driven network simulator ns-2|14|. The 
proposed protocol, D^\7\, and MPTCP lflTl are implemented 
as the delivery protocols on BCube topology as well. The 
MPTCP implementation is based on the open-source project 
multipath-tcp on ns-2|15|. As mentioned in [91, a host within 
n-level BCube connects to m switches and at least m disjoin 
path between any host pair is guaranteed. Hence, the BCube 
network consists of 125 hosts, and each host connects to 3 
switches. To simulate the common commodity switches used 
in datacenter networks, we set the packet buffer to 4MB at the 
switch, and the packet size is 1KB. Each switch can buffer 
at most 4000 packets before dropping any packet. However, 
because the proposed protocol is based on the rate control 
mechanism, the size of the queue are small (<200) all the 
time. The link capacity is IGbps. The round trip propagation 
delay varies from 35/is to 100/iS depending on the hop counts 
between the pair of nodes. We follow the deadline setting as 
in Q, where the mean value of deadline of flows is set to 
20 milliseconds, 30 milliseconds, and 40 milliseconds, which 
represent the emergency of response from tight, moderate to 
loose. As mentioned in Section III, each flow is divided into 
two smaller parts: an important flow and a regular flow. The 
traffic pattern is similar to the traffic generated by partition- 
aggregate operations. An aggregating host is first randomly 
selected from the network, and the rest of 124 hosts will then 
generate flows to the aggregating host at the same time. 

We use two datasets to conduct the evaluation. The first 
one is a synthetic dataset. In this scenario, the distribution 
of flow size follows the uniform distribution as in Q. The 
flows in the light-load network is uniformly distributed across 
[2KB, 50KB]. Flow size across [50KB, 100KB] and across 
[100KB, 150KB] represent medium-load and heavy-load net- 
works respectively. For the distribution of flow importance, 
we assume half of the response units of a flow have high 
importance, which is set to 10, and the rest of response units 
have low importance, which is set to 1. The other dataset 
consists of a set of real-world articles, which comes from 
Nil Test Collection for IR Systems (NTCIR) ProjectlS]. We 
distribute the data to the hosts according to the data placement 
policy of GFS|,6J. 

A. Metrics 

The primary goal of our evaluation is to determine the value 
of employing flow importance information to allocate network 
capacity. We would like to verify if the proposed protocol is 
able to exploit path diversity. Thus, the following metrics are 
used to conduct the evaluation on the synthesis dataset. 

1) Application-level throughput: the sum of meeting- 
deadline flow size. Only the flow delivered before its 
deadline can be count in. 

2) Application-level aggregated importance: total amount 
of flow importance delivered before flow deadline. Flows 



with high importance contribute important responses to 
users. So, the aggregated application importance repre- 
sents the quality level users experienced. 
3) Ratio of meeting-deadline flows: the ratio of flows 
delivered before their deadline. This results can verify 
if highly important flows of the proposed protocol does 
have better delivery ratio. 
For the evaluation on the real dataset, we use the precision 
at K, which is the fraction of received response units that are 
relevant to the query, to determine if the proposed protocol 
can provide more highly important responses to the users. The 
formal definition of precision at K is: 



Precision® K = 



\{Top-K relevant units} f]{received units}\ 
{{received units}\ 

B. Compared Protocols 

We compare the proposed protocol with D^[l\ and 
MPTCP iCni. MPTCP exploits the pafli diversity in the data- 
center networks to achieve high network utilization. It adopts 
TCP-like transport protocol and does not consider flow dead- 
line and importance. leverages explicit rate control mech- 
anism in a semi-distributed fashion to enforce the deadline 
of the flows. Although it takes flow deadline into account, 
does not be aware of the path diversity in the datacenter. 
Therefore, we modify to randomly select one of the disjoin 
paths for a flow to improve the network utilization and balance 
the load of paths. 

For a fair comparison, we improve and MPTCP by 
injecting the flow splitting mechanism as mentioned in Section 
III.B.4. The flow splitting mechanism helps and MPTCP to 
enhance the delivery probability of highly important response 
units. 

Since the performance of original and MPTCP are worse 
than the modified versions, we only present the results of 
modified and MPTCP here, due to the lack of space. 

C. Results of synthesis dataset 

Figure]?] gives the application-level throughput of the proto- 
cols. Figure ]4|a) shows the goodput in the light-load network. 
Both the proposed protocol and D'^ can mostly cope with 
all flows and both of them outperform MPTCP MPTCP 
aggressively increases transmission rate till packet loss occurs. 
Once a packet is lost, MPTCP needs to wait for a interval 
of TCP retransmission timeout (RTO) before retransmission. 
Because the default value of RTO is usually set to 100ms or 
200ms, once a packet of a near-deadline flow gets lost, the flow 
will mostly miss its deadline. This is the main reason why the 
performance of MPTCP is poor in all cases. has similar 
goodput to the proposed protocol at low (medium) load with 
loose deadlines. However, as shown in Figure |4|b)(c), when 
the network is under heavy-loaded or when the deadline is 
tight, the proposed protocol outperforms D'^. This is because 
D'^ does not well exploit path diversity. 

As mentioned in llT6l . at a partition-aggregate operation, 
massive flows will be generated, and this leads to huge amount 
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of traffic at the links to the aggregating host. For example, 
when the network is light-loaded with moderate deadlines, the 
total network capacity to the aggregating host must be equal 
to or larger than os^^imlTAlKxf ~ 0.86Gbps for a partition- 
aggregate operation. Since the proposed protocol provides 1.9 
times, 3.67 times, and 1.7 times of goodput over D'^ with 
tight, moderate, and loose deadlines. We believe the proposed 
protocol is robust and be able to deal with rapid traffic. 

Figure |5] illustrates the sum of importance contributed by 
meeting deadline flows. When the network load is heavy, the 
proposed protocol provide 3.0 times, 4.64 times, and 1.9 times 
data importance contributions to with tight, moderate, and 
loose deadlines. Note that the difference in total importance 
is larger than in goodput. That is, when the network capacity 
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becomes the critical bottleneck, the proposed protocol can well 
utilize available bandwidth to deliver highly important flows. 
Hence, the total data importance will be greater than those 
produced by conventional deadline-aware protocols. 

To further verify if the proposed protocol always produces 
more data importance, we analyze the delivered flows at 
the aggregating host. Figure [6] shows the ratio of flows that 
successfully meet their deadline in the heavy-load network. 
The delivery ratio of important flows of the proposed pro- 
tocol significantly outperforms others. This proves that our 
importance-aware protocol can improve the delivery ratio of 
most important flows. 

D. Results of NTCIR dataset 

In this scenario, we first use a common retrieval model, i.e. 
the vector space model, to calculate the similarity of each data 
unit to each query. Then, we can obtain the lo^-K relevant 
rank lists of each query. In each rank list of a query, K 
data units, which are most similar to the query, are recorded 
as the ground truth. Hence, we use the data units received 
by the aggregating host before deadline and the rank list to 
calculate the precision at K of the query. Figure |7] gives the 
results of the precision at K when the deadlines are tight. Not 
surprisingly, the proposed protocol significantly outperforms 
MPTCP and D^. MPTCP is unable to satisfy the requirements 
of deadline-aware services because of the RTO problem. 
can not achieve high precision because it does not consider 
flow importance. The proposed protocol, as mentioned before, 
always delivers most important flows first, and hence provides 
outstanding precision at K all the time. 

V. Conclusion 

In this work, we investigate the requirements to provide 
high quality deadline-aware online services in datacenters to 
users. Based on the observations from the literaturesIJi^ and 
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the observation of flow contents, we found that the fulfill- 
ment of deadline constraint and high bandwidth utilization 
do not guarantee high service quality. Thus, we exploit the 
application-level content information, i.e., data importance, to 
better fulfill the requirements of deadline-aware services for 
providing better user experience, and hence producing more 
revenue for the service providers. 

To figure out the optimal solution, we first model the 
datacenter network as a flow network for analyzing the appli- 
cation importance maximization problem with the constraints 
of network link capacity, flow conservation, and flow deadline. 
Although the property of static multi-disjoin routing paths, 
which exists in low-cost server-centric datacenter infrastruc- 
ture designs i.e., BCube[9| and MDCube|lO|, significantly 
reduces the complexity of the problem, the optimal solution 
is still impractical for realistic datacenter networks. 

Therefore, based on the ideas of the optimal solution, we 
further propose a novel distributed rate-based importance- 
aware delivery control protocol. The results shows that our 
proposed protocol provides significant benefits over even 
optimized versions of existing protocols in terms of both 
applicaion-level throughput and importance. The results of 
real-world dataset also demonstrate that the proposed protocol 
can provide better precision at K all the time. Since the grow- 
ing demand for deadline-aware online services, we believe 
exploiting data importance to provide high-rank results for 
users is important toward high-quality deadline-aware services 
beyond the state-of-art solutions. 
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